Skip to content

some initial tweaks of the input format #132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 10, 2025
Merged

Conversation

superwhiskers
Copy link
Collaborator

description

this pull request proposes some tweaks to the input format itself, and to the language of the specification

proposed changes

here's a quick summary of the changes made:

  • changes from the nested heterogeneous lists in the initial spec to lists of objects
  • tightening and clarification of language
  • removal of redundant parts of the specification

further comments

  • on line 118, there is a note about adding the variables monitoring is supported for in each device class to the device class table (as is with the nodes, as far as i know). this should be done and the comment should be removed before this is merged.
  • we should support messagepack in the gridkit codebase. this is noted at the top of the spec but i'd like to discuss it prior to merging and at what point to attempt implementing it (and how).
  • i'm not an electrical engineer so i may have messed up some aspects of the language. this should be reviewed (or maybe reverted? most of my edits were made with the intent of making aspects of the language clearer but this may not be necessary)
  • can we remove the case_ prefix from case_name, case_description, case_comments in the header section?
  • i removed the case_date field from the header as it seems to me that this role could be fulfilled by either case_description or case_comments, but was this assumption incorrect?
  • i haven't touched the output file format section of the specification. i'd like to discuss if the role of this could be filled better with a utf8 encoded json schema or something else prior to reworking it.
  • anything else that stands out as wrong or unclear.

@superwhiskers superwhiskers self-assigned this Jun 6, 2025
Copy link
Collaborator

@pelesh pelesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but I'd let @abirchfield decide whether to merge this.

As an additional comment, it would be good to add another appendix with JSON style guidelines for input files.

@abirchfield
Copy link
Collaborator

One of the goals of the format is that it will be non-repetitive to keep file size concise and enhance human readability. Notice that, with the proposed changes, the demo file has ballooned to about double its prior size. Imagine if we have 2000 buses, and each of them has the identical literal strings "id", "class", "name", "default_voltages", "voltage_base" rather than just an ordered list with known required items. Same thing if we have 300 GENROUs. We will end up with massive files with lots of repetitive text that are neither efficient for humans nor computers to read. It is a stylistic choice but I prefer the original.

@superwhiskers
Copy link
Collaborator Author

superwhiskers commented Jun 8, 2025

json parsers these days are quite fast, i don't think there would be any significant performance overhead to this. the main reason i think objects over arrays for device & node descriptions is the way to go is that they're more self-describing compared to the heterogeneous arrays used previously.

additionally, another change i made was pulling out the optional fields we had already into the object we now use for describing these. this removes our specified optional fields from the same namespace that other consumers of this format would use to implement their own extensions, freeing us from ensuring new optional fields we add don't have names that conflict with those used by others. this is also beneficial, and not something that could've been done with the previous approach unless we had two separate objects in the array: one for us and the other for extensions.

finally, the use of an object over a heterogeneous array is just more in line with people's expectations, but this may not be that convincing of an argument.

@abirchfield
Copy link
Collaborator

I committed some changes. I think I am persuaded to use more of an object-oriented format rather than tables. In going through it, I realized some of the names were not consistent with nomenclature decisions we made since the original draft of the format. @superwhiskers feel free to take another pass through.

@superwhiskers
Copy link
Collaborator Author

so there's one major comment here. in the initial pull request, i removed the case_date for this reason (mentioned in the initial comment):

i removed the case_date field from the header as it seems to me that this role could be fulfilled by either case_description or case_comments, but was this assumption incorrect?

i notice you've added it back. was my assumption incorrect?

@superwhiskers
Copy link
Collaborator Author

i'd also like to start some discussion of the output format. aside from csv, some other alternatives that would work are json lines or perhaps the same thing with messagepack.

this may not be the best choice if streaming output is not a concern we have (the primary benefit of such formats would be the ability to easily stream output from one area to another) but they also provide the benefits of json and messagepack, which is that they are better specified than something like csv, but issues with that may not come up for our use case.

@abirchfield
Copy link
Collaborator

It is typical and convenient to have a separate field in a case file for the case date.

@abirchfield
Copy link
Collaborator

What this would look like in JSON lines? Keep in mind we may have 100k or more time points for some studies. I would not recommend repeating all the output channel names for each time point. I guess the simplest would be just to add a [ and ] to the beginning and end of each of the lines in my example, so that every line is an array representing a time point, except the first line is an array with channel labels.

@abirchfield
Copy link
Collaborator

Of course the first thing I would do for my own workflow is write a script to convert it to CSV so I can open it in Excel...

@superwhiskers
Copy link
Collaborator Author

superwhiskers commented Jun 9, 2025

ah. if your main use case is opening it in excel then sticking with csv would probably be the better choice. json lines or something like that is probably excessive and wouldn't be of any benefit to you. the only reason i bring it up is that i've used it in the past for some simulation work and retaining the output channel names makes every single pipeline slightly more self-describing (the size increase is negligible if the json is compressed). additionally, there are many tools for working with json.

@superwhiskers
Copy link
Collaborator Author

superwhiskers commented Jun 9, 2025

i've explicitly specified a date format (YYYY-MM-DD) so that there's no confusion about the date being represented by a case with this field and marked the field as optional as it wasn't used in the example and it may not always be applicable. i've also removed the "uncommon" label applied to some fields as it wasn't clear to me what that was there to indicate. suggesting that such a field overrides the default value seemed enough to me.

the remaining things i'd like to complete before this is ready to merge / ready for final review are:

  • clarify the format of the output csv and adjust the language
  • explicitly specify default values for the init field of the bus objects when they are not given (this could be added to the bus class table)

@abirchfield
Copy link
Collaborator

Generators commonly specify data on their own power base, overriding the system default. Very rarely would a device override the frequency base, but it could.

@abirchfield
Copy link
Collaborator

User should be able to specify time as well as date. Generally this could be written automatically by a program writing the file.

@superwhiskers
Copy link
Collaborator Author

would it not be more appropriate to call it case_datetime then, or something along those lines? i'll specify that an iso 8601 datetime (with or without timezone) as well as just a date is allowed, then.

@abirchfield
Copy link
Collaborator

The output CSV format is just an example. If you want to add any specifics go ahead.

@abirchfield
Copy link
Collaborator

For default values of each of the bus classes, it will be 1.0 for Vr and Va, and it will be 0.0 for all other variables. This is probably easier said in a note below the table rather than cluttering the table.

@superwhiskers
Copy link
Collaborator Author

my only confusion regarding the existing output format is the meaning of the "solver status" column. it's not clear to me how portable such a field would be. it may be better to just leave that for another output channel (that is, not bundling it into the same file containing samples of all of the monitor channels).

@abirchfield
Copy link
Collaborator

I would prefer all the output be together. Maybe the solution is to add an optional "mon" field to the header section, for system-level outputs. This could include solver status, but could also include various aggregate measurements. Time, of course, would still always be the first channel.

@superwhiskers
Copy link
Collaborator Author

how do you feel about a json lines derived output format like this? each timestep would be like this:

{"time": 0, "solver": { /* solver-dependent object */ }, "buses": [/* bus objects */], "devices": [/*device objects*/]}

the solver object contains status arbitrary status information dependent upon the solver used (as this kind of state can vary between solvers). the bus objects are as follows:

{"number": 0, "monitored": {"variable": /* state */}}

the device objects are as follows:

{"ports": /* same as how they are specified */, "id": /* some value */, "monitored": {/* same as above*/}}

this is a quick sketch. i decided to move back to json lines as i couldn't think of a single good way to render this information down to a csv without recreating all of the niceties of a format like json. suggest any changes you think would improve it.

@abirchfield
Copy link
Collaborator

The basic idea for the output is: you define what variables you want to monitor and the program exports a matrix of those variables across time. CSV is most useful for that, so I at least want a CSV output option. We can add additional, more complicated output formats in the future if they will be useful for anyone.

@superwhiskers
Copy link
Collaborator Author

absolutely. my main thought here was that it didn't seem like i could make a single csv-derived format that would fit most use cases so i propose that json lines format and a tool (a python script or something) that converts it to a csv the way you'd want it as proposing a csv like what you were thinking of seems too specific to make sense proposing as a "standard" output format (i'm mainly thinking of the solver data here).

alternatively, we just cut the output format from this specification and develop one outside of it before adding one to this spec.

@pelesh pelesh merged commit f65eb4f into develop Jun 10, 2025
4 checks passed
@superwhiskers
Copy link
Collaborator Author

we may need to make a separate pull request then to make a tweak or two to the output format as i don't know how well it'd work as-is (primarily talking about the solver status column still).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants